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Abstract 



The minimum description length (MDL) principle was developed in the 
context of computational complexity and coding theory. It states that the 
best model to account for some data minimizes the sum of the lengths, 
in bits, of the descriptions of the model and the data as encoded via the 
model. The MDL principle gives a criterion for parameter selection, by 
using the description length as a test statistic. Class I HLA genes play a 
major role in the immune response to HIV, and are known to be associ- 
ated with rates of progression to AIDS. However, these genes are highly 
polymorphic, making it difficult to associate alleles with disease outcome, 
given statistical issues of multiple testing. Application of the MDL prin- 
ciple to immunogenetic data from a longitudinal cohort study (Chicago 
MACS) enables classification of alleles associated with plasma HIV RNA 
abundance, an indicator of infection progression. We recently reported 
that MDL analysis of the relationship of HLA supertypes (a classifica- 
tion of alleles by epitope-binding anchor motifs) with HIV RNA levels 
identifies associations between human genotype and viral RNA. Details of 
the MDL approach and more extended analyses of HLA and viral RNA 
are described here. Variation in progression is strongly associated with 
HLA-B. Allele associations with viral levels support and extend previous 
studies. In particular, individuals without B58s supertype alleles average 
viral RNA levels 3.6-fold greater than individuals with them. Mechanisms 
for these associations include variation in epitope specificity and selection 
that favors rare alleles. 
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Progression of HIV infection is characterized by three phases: acute, or 
early, chronic, and AIDS, the final phase of infection preceeding death [TJ. The 
chronic phase is variable in duration, lasting ten years on average, but varying 
from two to twenty years. A good predictor of the duration of the chronic phase 
is the viral RNA level during chronic infection, with higher levels consistently 
associated with more rapid progression than lower levels [2]. A major challenge 
for treating HIV and developing effective vaccination strategies is to understand 
what contributes to variation in plasma viral RNA levels, and hence to infection 
progression. 

The cell-mediated immune response identifies and eliminates infected cells 
from an individual. A central role in this response is played by the major 
histocompatibility complex (MHC), in humans, also known as human leukocyte 
antigens (HLA). Two classes of HLA genes code for co-dominately expressed 
cell-surface glycoproteins, and present processed peptide to circulating T-cells, 
which discriminate between self and non-self 00). 

Class I HLA molecules are expressed on all nucleated cells except germ cells. 
In infected cells, they bind and present antigenic peptide fragments to T-cell 
receptors on CD8+ T-lymphocytes, which are usually cytotoxic and cause lysis 
of the infected cell. Class II HLA molecules are expressed on immunogenetically 
reactive cells, such as dendritic cells, B-cells, macrophages, and activated T- 
cells. They present antigen peptide fragments to T-cell receptors on CD4+ 
T-lymphocytes and the interaction results in release of cytokines that stimulate 
the immune response. 

Human HLA loci are among the most diverse known This diversity 

provides a repertoire to recognize evolving antigens Ej ■ Previous studies of 
associations between HLA alleles and variation in progression of HIV- 1 infection 
have established that within-host HLA diversity helps to inhibit viral infection, 
by associating degrees of heterozygosity with rates of HIV disease progression 
Thus, homozygous individuals, particularly at the HLA-B locus, suffer a 
greater rate of progression than do heterozygotes Identifying which alleles 

are associated with variation in rates of infection progression has been difficult, 
due in part to the compounding of error rates incurred when testing many 
alternative hypotheses, and published results do not always agree fTUl ITT) . 

This study demonstrates the use of an information-based criterion for sta- 
tistical inference. Its approach to multiple testing differs from that of standard 
analytic techniques, and provides the ability to resolve associations between 
variation in HIV RNA abundance and variation in HLA alleles. 

As an application of computational complexity and optimal coding theory 
to statistical inference, the minimum description length (MDL) principle states 
that the best statistical model, or hypothesis, to account for some observed 
data is the model that minimizes the sum of the number of bits required to 
describe both the model and the data encoded via the model E3 ED EH It 
is a model-selection criterion that balances the need for parsimony and fidelity, 
by penalizing equally for the information required to specify the model and the 
information required to encode the residual error. 

The analyses detailed below apply the MDL principle to the problem of 
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partitioning individuals into groups having similar HIV RNA levels, based on 
HLA alleles present in each case. 

Chicago MACS HLA & HIV Data 

The Chicago Multicenter AIDS Cohort Study (MACS) provided an opportunity 
to analyze a detailed, long-term, longitudinal set of clinical HIV/HLA data 
Each participant provided informed consent in writing. Of 564 HIV-positive 
cases sampled in the Chicago MACS, 479 provided information about both 
the rate of disease progression and HLA genetic background. Progression was 
indicated by the quasi-stationary "set-point" viral RNA level during chronic 
infection. Immunogenetic background was obtained by determining which HLA 
alleles from class I (HLA-A, -B, and -C) and class II (HLA-DRB1, -DQB1, and 
-DPB1) loci were present in each individual. 

Viral RNA set-point levels were determined after acute infection and prior 
to any therapeutic intervention or the onset of AIDS, as defined by the pres- 
ence of an opportunistic infection or CD4 + T-cell count below 200 per ml of 
plasma. Because the assay has a detection threshold of 300 copies of virus per ml 
|1U|. maximum- likelihood estimators were adjusted to avoid biased estimates of 
population parameters from a truncated, or censored, sample distribution 15 . 
Viral RNA levels were log-transformed so as better to approximate a normal 
distribution. 

High-resolution class I and II HLA genotyping ^U] provided four-digit allele 
designations, though analyses were generally performed using two-digit allele 
designations because of the resulting reduction of allelic diversity and increased 
number of samples per allele. Because of the potential for results to be con- 
founded by an effect associated with an individual's ethnicity or revised sam- 
pling protocol, two separate analyses were performed, one using data from the 
entire cohort, and another using only data from Caucasian individuals. Sample 
numbers were too small to study other subgroups independently. 

HLA supertypes group class I alleles by their peptide-binding anchor mo- 
tifs JSj. Assignment of four-digit allele designations to functionally related 
groups of supertypes at HLA-A and -B loci facilitated further analysis. Where 
they could be determined, HLA-A and HLA-B supertypes were assigned from 
four-digit allele designations 10 . As with two-digit allele designations for each 
locus, HLA-A and -B supertypes were assessed for association with viral RNA 
levels. Cases having other alleles were withheld from classification and subse- 
quent analysis of supertypes. 

A description length analysis determined whether HIV RNA levels were non- 
trivially associated with alleles at any HLA locus. 

Description Lengths 

The challenge of data classification is to find the best partition, such that ob- 
servations within a group are well-described as independent draws from a single 
population, but differences in population distributions exist between groups. 
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Whether the data are better represented as two groups, or more, than as one 
depends on the description lengths that result. 

We use the family of Gaussian distributions to model viral RNA levels. 
While the MDL strategy can be applied using any probabilistic model, a log- 
normal distribution is a good choice for the observed plasma viral RNA values. 
First, the description length of the model and of the data given the model is 
calculated as described below, grouping all of the observations into one normal 
distribution, L\. Next, the data are broken into two partitions, L2, and the 
log-RNA values associated with HLA alleles are partitioned to minimize the 
description length given the constraint that two Gaussian distributions, each 
having their own mean and variance, are used to model the data. 

For fixed nxn covariance matrix S, the description length is Ls — \ log |S| + 
|y'E _1 y + C, where Y is the n-component vector of observations and C is 
the quantity of information required to specify the partition. Logarithms are 
computed in base two, with fractional values rounded upwards, so that the re- 
sulting units are bits. The description length of interest results from integrating 
L over all covariance matrices with the appropriate structure. In practice, we 
use Laplace's approximation for the integral E| which gives, asymptoti- 
cally, L = i log |S| + iF'E _1 y + I \ogn + C, where k is the number of free 
parameters in the covariance model, and £ is the specific covariance matrix of 
the appropriate structure that minimizes A more detailed account appears 
in the Appendix. 

The analog of a null hypothesis is the assumption that one group of alleles 
is sufficient to account for the variation in viral RNA. The description length 
for one group is: L\ = \ (n + (n — 1) log s 2 + log nx 2 + 2 log n) , where n is the 
total number of observations, s 2 is the maximum-likelihood estimate of the 
population variance and x is the sample mean, computed as the Winsorized 
mean [TSj because of truncation below the sensitivity limit of the RNA assay. 

It follows that the description length for two groups can be computed as: 

1 2 

L2 = ^ X! ( Ui + - lo S s2 + lo S n iXi + 2 \ogrii) + C, 

i=l 

where C is an adjustment for performing multiple comparisons. Because ad- 
ditional information is required to specify the optimum partition, the description 
length is increased by a quantity related to the number of partitions evaluated, 
such that C = Nlogk bits, where N is the number of alleles observed at the 
partitioned locus. For k — 2, C = N . 

Further partitions of alleles into more than two groups might yield a shorter 
description length, computed as a summation over terms in the equation for L2 
for each of the k distinct groups. 

The shortest description length for any value of k indicates the best choice of 
model parameters, including the number of parameters, and hence, the optimum 
partition of N alleles into k groups. We denote this as L*. 
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Algorithm 

The minimum description length is found by iteratively computing the descrip- 
tion length for each possible partition of alleles into groups and taking the mini- 
mum as optimal. Iteration consists first of determining the number of alleles, N, 
at a particular locus, and then incrementing through each of the k^ N ~^ possible 
partitions of alleles into k groups, computing the associated description length, 
and reporting the best results. Each iteration evaluates one possible mapping 
of alleles to groups. Searching through all possible partitions using the descrip- 
tion length as an optimality criterion ensures selection of the best partition as 
a result of the search. 

In this mapping, the ordering of groups is informative, because the ordering 
gives the relative dominance of alleles for diploid loci. An individual having an 
allele assigned to the first-order group is assigned to that group. Otherwise, the 
individual is assigned to the next appropriate group. Two individuals sharing 
one allele might be placed in either the same group or different groups, depend- 
ing on the mapping of alleles to groups in a particular iterate. For example, 
consider how one might group two individuals, one with alleles Al and A 2 at 
some locus, and another with alleles A2 and A3. Whether or not they are 
grouped together depends on the assignment of alleles to groups, and can be 
done several different ways. The algorithm enumerates each possible assignment 
of alleles to groups. 

The extent of the search scales as k N . In practice, the most diverse locus 
was HLA-B, with 30 alleles when analyzed using two-digit allele designations. 
For two groups, this gives 2 30 w 10 8 possible partitions. Serial iteration on an 
UltraSPARC-Hi 440MHz CPU (Sun Microsystems) requires roughly 36 hours 
for completion. A parallel implementation requires no message passing, so com- 
puting time scales inversely with an increasing number of CPUs, or doubling 
available processors halves the time for iteration. With many CPUs, the search 
space of 2 30 partitions can be exhaustively evaluated in an hour or less. Un- 
fortunately, exhaustively evaluating all three-way partitions is prohibitive, as 
3 30 «2x 10 14 , over a million-fold increase in computational effort! Supertype 
classification reduced the diversity of possible partitions and enabled partition- 
ing of the data into more than two groups. 

The algorithm was implemented in C and will be distributed on request. 

Class I & II HLA Results 

The description length for the entire cohort as one group is L\ = 934 bits; for 
the Caucasian subsample, it is L\ = 721 bits. In general, L\ < L 2 at most loci 
(Table I), so the MDL criterion does not support partitioning alleles into groups 
that are predictive of high or low RNA levels, except at HLA-B, where L 2 < L\. 
In the subsample, partitioning HLA-C or HLA-DQB1 alleles can also provide 
preferred two-way splits, though not as well as HLA-B. Further partitioning was 
intractable because of great allelic diversity, as previously mentioned. Partitions 
of HLA-B alleles provide the best groupings among all loci. Because L\ < L\, 
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two groups, partitioned by HLA-B alleles, provide a better description than one 
(Fig. la and lb). 

What is the composition of the optimum groupings? For the entire cohort, 
the following alleles were associated with low viral RNA levels: B*13, B*27, 
B*38, B*45, B*49, B*57, B*58, and B*81. The remaining alleles, associated 
with greater viral RNA than the first group, are: B*07, B*08, B*U, B*15, B*18, 
B*35, B*37, B*39, B*40, B*41, B*42, B*44, B*47, B*48, B*50, B*51, B*52, 
B*53 7 B*55, B*56, B*67, and B*82. As described earlier, having any alleles 
associated with the first group is sufficient for an individual to be assigned to 
the group having lower viral RNA. 

How robust are these assignments of alleles to groups? Four alternative 
groupings provide description lengths within one bit of the optimum. They do 
not dramatically rearrange the assigment of individuals to groups, but do pro- 
vide insight as to which alleles are assigned to either group with less confidence. 
Among near-optimal partitions, alleles B*82 and 5*67 were assigned to groups 
other than in the optimum partition. 

In the Caucasian subsample, alleles B*13, B*27, B*40, B*45, B*48, B*49, 
B*57, and B*58 are associated with lower viral RNA, and the remaining alleles, 
B*07, B*08, B*14, B*15, B*18, B*35, B*37, B*38, B*39, B*41, B*44, B*47, 
B*50, B*51, B*52, B*53, B*55, and B*56, or lack of any alleles from the first 
group, are associated with greater viral RNA levels. Two nearly optimal parti- 
tions assigned alleles B*47 and B*48 to the second group. Fig. 1 illustrates the 
distributions of viral RNA levels from this subsample, as one group (Fig. lc) 
and as the best partition at HLA-B (Fig. Id). 

To summarize the most robust inferences from the analyses of two-digit allele 
designations, individuals having HLA-B alleles B*13, B*27, B*45, B*49, B*57, 
or B*58 were associated with lower viral RNA levels than their counterparts 
lacking these alleles. 

Comparison of groupings obtained via the MDL approach with more tradi- 
tional means for statistical inference, a two-tailed, two-sample, Welch modified 
t-test, which does not assume equal variances, and its non-parametric variant, 
the Wilcoxon rank-sum test [T^], was very favorable. In each case, the null hy- 
pothesis was that of no difference between the group mean log-transformed viral 
RNA levels, and the alternative hypothesis was that the means differ. Both tests 
agreed in rejecting the null hypothesis in favor of the alternative (P < 10~ 10 ). 

HLA Supertype Results 

Assigning the diploid, co-dominantly expressed HLA-A alleles to four HLA-A 
supertypes 16,i , Als, A2s, A3s, and A24s, was possible for 399 individuals. The 
mapping of HLA-B alleles to five supertypes, 57s, 527s, B44s, B58s, and B62s, 
was made for 352 individuals. The resulting decrease in allelic diversity enabled 
analysis for k > 2. 

Description lengths of the best fc-way partitions of supertype alleles for HLA- 
A supertypes are: L\ = 793, L 2 = 782, L 3 = 789, and L 4 = 794 bits. The best 
description length results from a two-way split, though a three-way split also 
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yields a shorter description length than that obtained from one group. The best 
partition of HLA-A supertypes assigned individuals having A Is alleles to the 
low RNA group. 

For HLA-B supertypes, L x = 704, L 2 = 691, L 3 = 693, and L 4 = 697 bits 
(Fig. le). The best model results when k = 2. Overall, individuals lacking B58s 
alleles averaged viral RNA levels 3.6-times greater than individuals having B58s 
supertype alleles (Fig. If). Thus, individuals with B58s alleles have significantly 
lower viral RNA levels than individuals without them. 

Table 2 summarizes results of assigning HLA-B associations to high or low 
viral-RNA categories as two-digit allele designations from both the entire cohort 
and the Caucasian subsample, and as supertypes for those individuals having 
two alleles that could be assigned to a supertype. Alleles not found in a sample 
are indicated by a dash. The B*15 alleles are not shown because their high- 
resolution genotype designations correspond to four different supertypes. 

Overall, the most consistent associations with low viral RNA are among the 
B58s, and with high viral RNA, the Bis. Inconsistencies in assignment to a 
category occur for the B*13, B*27, B*45, and B*49 alleles, which are in the low 
viral-RNA group when analyzed as such, but the high viral-RNA group when 
assigned to supertypes. 

When compared with alternative inferential techniques, the difference be- 
tween group viral RNA levels was highly significant. This and agreement with 
alleles reported to be associated with variation in viral RNA levels in previously 
published studies indicate that using the description length as a test statistic 
can provide reliable inferences. 

MDL & Statistical Inference 

The traditional statistical solution is to pose a question as follows: suppose that 
the simpler model (e.g., one homogeneous population) were actually true; call 
this the null hypothesis. How often would one, in similar experiments, get data 
that look as different from that expected under the null hypothesis as in the 
actual experiment? 

This technique has limitations when the partition that represents the al- 
ternative hypothesis is not given in advance. There are then many potential 
alternative partitions and the appropriate distribution under the null hypothe- 
sis for this ensemble of tests is very difficult to estimate. Furthermore, for proper 
interpretation, the outcome relies upon the truth of the initial assumption: that 
the data are distributed as dictated by the null hypthothesis. 

An alternative is to choose that model that represents the data most effi- 
ciently. Here, efficiency is the amount of information, quantified as bits, required 
to transmit electronically both the model and the data as encoded by the model. 
This criterion may not seem intuitively clear on first exposure. However, it fol- 
lows naturally from a profound relationship between probability and coding the- 
ory that was discovered, explored, and elaborated by Solomonoff, Kolmogorov, 
Chaitin, and Rissanen [TU E01 EU E21 E3| 
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The idea is quite simple and elegant. It can be illustrated by analogy to the 
problem of designing an optimal code for the efficient transmission of natural- 
language messages. Consider the international Morse code. Recall that Morse 
code assigns letters of the Roman alphabet to codewords comprised of dots 
("•") and dashes ("—")■ The codewords do not all have the same number of 
dots and/or dashes; it is a variable-length code. 

Efficient, compact encodings result from the design of a codebook such that 
the shortest codewords are assigned to the most frequently encoded letters and 
long codewords are assigned to rare letters. Thus, e and t are encoded as "■" 

and "— " , respectively, while q and j are encoded as " • — " and "• " . 

The theory of optimal coding provides an exact relationship between frequency 
and code length and thus, probability and description length. 

The key departure of MDL from Morse-codelikc schemes is that, while Morse 
code would generally be good for sending messages over an average of many 
texts, specific texts might be encoded even more efficiently, by encoding not 
only letters, but letter combinations, common words, or even phrases, perhaps 
as abbreviations or acronyms. However, if one is to recode for particular texts, 
one must first transmit the coding scheme. So perhaps one might use Morse 
code to transmit the details of the new coding scheme and then transmit the 
text itself with the new scheme. Whether this might yield greater efficiency 
depends not only on how much compression is achieved in the new encoding, 
but also on how much overhead is incurred in having to transmit the coding 
scheme. 

The analogy to scientific data analysis is clear. A statistical model is an en- 
coding scheme that encapsulates the regularities in the data to yield a concise 
representation thereof. The best model effectively compresses regularities in 
the data, but is not so elaborate that its own description demands a great deal 
of information to be encoded. The MDL principle provides a modcl-sclcction 
criterion that balances the need for a model that is both appropriate and par- 
simonious, by penalizing with equal weights the information required to specify 
the model and the unexplained, or residual error. 

Yet another contribution the MDL principle brings to statistical modelling 
is that the penalty for multiple comparisons is less restrictive than the penalty 
of compounded error rates incurred with canonical inferential approaches. In 
order to maintain a desired experiment-wide error rate, the standard adjustment 
is to make the per-comparison error rate considerably more stringent. With 
current technology, realistic sample sizes for such studies will generally be less 
than a thousand and stringent significance levels will be difficult to surpass. 
Unfortunately, fixing the false-positive error rate does not address the false- 
negative probability, which may leave researchers powerless to detect effects 
among many competing hypotheses with limited samples. 

Mechanisms 

Of HLA supertype alleles, individuals with B58s have lower viral RNA levels 
than those who lack them, even among homozygotic individuals. Naturally, 
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this leads one to consider mechanisms that underlie patterns found in the data. 
Elsewhere, we consider two hypotheses to explain the observed associations 
between HLA alleles and variation in viral RNA JU| . 

There may be allele-specific variation in antigen-binding specificity. Some 
alleles may have greater affinity than others for HIV-specific peptide fragments 
due to the peptide-binding anchor motifs they present. We were not able to 
identify any clear association between the frequency of anchor motifs among 
HIV-l proteins and viral RNA levels in the Chicago MACS ^Uj, though others 
have suggested that such a relationship might exist [23] • 

It may also the case that frequency-dependent selection has favored rare 
alleles. Frequent alleles provide the evolving pathogen greater opportunity to 
explore mutant phenotypes that may escape detection by the host's immune 
response. By encountering rare alleles less frequently, the virus has not had the 
same opportunity to explore mutations that evade the host's defense response. 
This hypothesis is corroborated by a significant association between viral RNA 
and HLA allele frequency in the Chicago MACS sample |1U| . 

Because their predictions differ, these hypotheses could be tested with data 
from another cohort, where a different viral subtype predominates. That is, 
if other alleles were associated with low viral RNA than those identified in 
this study, and an association between rare alleles and low viral RNA levels 
were observed there, then the second hypothesis would be more viable than 
the first. Alternatively, if a clear association between antigen peptide-binding 
anchor motifs and variation in viral RNA levels were found, the first hypothesis 
would be more viable. Other mechanisms are also possible, and hypotheses by 
which to evaluate them merit consideration. 
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Appendix 

In Gaussian Process modeling [25], the population means are treated as random 
variables and integrated out of the likelihood. The model is then specified 
entirely by the structure of the covariance matrix S, which specifies how each 
pair of observations is correlated. The covariance is greater for two observations 
from the same partition than for two observations from different partitions. Any 
given partition is specified entirely by a corresponding covariance structure. 

Partitioning with Gaussian Models. Denote the n observations as the 
vector Y and the covariance matrix with parameter vector 9 by £(#). Let 
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the number of components of 8 (the number of free parameters in the co- 
variance matrix) be k. Then the MDL for the given covariance structure is: 
L = i log |£(0)| + iy'X:^)- 1 ^ + f logn + C, where C is the information re- 
quired to specify the partition or, equivalently, the covariance structure, and 9 
is the vector of covariance parameters evaluated at maximum likelihood. 

One Gaussian Population. The covariance matrix has a component a 2 ^ 
for the covariance among observations, induced by their sharing an unspecified 
mean, and an error component cr|: £ = a 2 1 + o 2 n ll', with 1 the column vector 
of all ones, 11' the matrix of all ones, and / the identity matrix. The inverse is: 

S-^^ff- 2 11' 

and the log-determinant: log |S| = (n — 1) log of + log(o 2 + no^). 
This gives L = \ (n + (n - 1) log o 2 + log(o 2 + na 2 m ) + 2 logn) . 
We find the maximum likelihood values of the parameters by minimizing 
over the description lengths. There are two cases. 

Case 1: n 2 Y 2 - Y'Y > 0. Here we have of = (n - l)- l (Y'Y - nY 2 ) and 
a 2 m = (n - l)" 1 ^ 2 - iy'y), so L = \{n + (n - 1) logo^ + lognF 2 + 2 logn). 

Case 2: n 2 y 2 — Y'Y < 0. Here the common mean vanishes, giving of = 
iy'y, o, 2 „ = 0, so L = f (1 + logo 2 + I logn). 

Many Gaussian Populations. Two partitions give two populations. To 
analyze the HL A/HIV data, we treated these populations as independent. That 
is, we take the covariance between observations in separate partitions to be zero, 
and apply the fitting procedure outlined above separately to the two popula- 
tions. An alternative is to take non-zero covariance between the two popula- 
tions. This results in a more elaborate estimation procedure, unlikely to yield 
large efficiency gains because the two degrees of freedom (population means) 
are essentially mixed into one, with residual error. 

The procedure examines each admissible partition and computes the MDL 
for that partition as the sum of individual description lengths over the two 
independent populations. The best partition yields the lowest description length 
over all partitions. This, plus the cost of specifying the partition, is compared 
with the MDL from the unpartitioncd data. If the best partition provides a 
better representation of the data than the unpartitioned set (Lk < Lk-i), then 
the process is repeated in a recursive manner, independently within each of the 
partitioned populations. 
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Figure Legends 



Fig. 1. Description- length comparisons of viral RNA distributions as one (L\) 
or two (L2) groups. Ordinate units are the expected number of observations 
between two tick marks over the abscissa, or one doubling of viral RNA. Impulses 
along the abscissa show individual observations, with jitter added to enhance 
rendering of identical values, (a) Observations (n) from the Chicago MACS 
cohort lumped into one group, and (b) split into the best partition as two groups, 
with individuals having alleles B*13, B*27, B*38, B*45, B*49, B*57, B*58, or 
B*8 1 assigned to the lower group (ni), and remaining individuals assigned to 
the group with greater viral RNA (r^)- (c) Observations from the Caucasian 
subsample as one group, and (d) as the best split into two groups, where having 
alleles B*13, B*27, B*40, B*45, B*48, B*49, B*57, or B*58 was the criterion 
for being assigned to the low viral-RNA group. Observations from individuals 
having two HLA-B supertype alleles, (c) in one group, and (f) partitioned into 
two groups, contingent on the presence of B58s. 
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Table 1: Optimum two-way partitions at each locus, with per-locus allelic di- 
versity (N), description lengths without the information cost to specify model 
parameters (L 2 — C), and minimum description lengths {L 2 ). 



Entire Cohort Caucasian Subsample 





n = 


479, L : 


= 934 


n = 


379, Li 


= 721 


Locus 


N 


L 2 -C 


L 2 


N 


L 2 -C 


L 2 


Class I 














HLA-A 


19 


916 


935 


18 


703 


721 


HLA-B 


30 


887 


917* 


26 


681 


707* 


HLA-C 


14 


921 


935 


13 


706 


719 


Class II 














DRB1 


13 


927 


940 


13 


711 


724 


DQB1 


5 


936 


941 


5 


715 


720 


DPB1 


24 


927 


951 


21 


710 


731 
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Tabic 2: HLA-B alleles associated with low (o) or high (•) viral RNA levels. 



Entire Caucasian Supertypes 
Allele Cohort Subsample Only 
n = 479 n = 379 n = 352 



B7s 








B*07 


• 


• 


• 


B*35 


• 


• 


• 


B*51 


• 


• 


• 


B*53 


• 


• 


• 


B*55 


• 


• 


• 


B*56 


• 


• 


• 


B*67 


o/. 


- 


• 


B27s 








B*U 


• 


• 


• 


B*27 


o 


o 


• 


B*38 


o 


• 


• 


B*39 


• 


• 


• 


B*48 


o/. 


0/. 


• 


BUs 








B*18 


• 


• 


• 


B*37 


• 


• 


• 


B*40 


• 


o 


• 


B*41 


• 


• 




B*44 


• 


• 




B*45 


o 


o 




B*49 


o 


o 




B*50 


• 


• 




B58s 








B*57 


o 


o 


o 


B*58 


o 


o 


o 


B62s 








B*13 


o 


o 


• 


B*52 


• 


• 


• 


Other 








B*08 


• 


• 




B*15 


• 


• 




B*42 


• 






B*47 


• 


0/. 




B*81 


o 






B*82 


of. 
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